[1] "C:/Users/UrsHu/Pillars/Learn/Academic/Master/Semestre 2/machine learning/Project/Machine_Learning"
1 Introduction
- Overview and Motivation
- Related Work
- Research questions
2 TESTING if R works and if Python works
#> [1] "hello"
#> 30.0
3 Data
- Sources
- Description
- Wrangling/cleaning
- Spotting mistakes and missing data (could be part of EDA too)
- Listing anomalies and outliers (could be part of EDA too)
3.1 Main dataset Cleaning
#> [1] "C:/Users/UrsHu/Pillars/Learn/Academic/Master/Semestre 2/machine learning/Project/Machine_Learning/docs"
#> price number_of_rooms address canton property_type
#> 1 1800000 65 1844 Villeneuve VD Vaud Apartment
#> 2 1980000 55 1820 Montreux Vaud Apartment
#> 3 488000 35 1882 Gryon Vaud Apartment
#> 4 1755000 7 1820 Montreux Vaud Apartment
#> 5 650000 25 1815 Clarens Vaud Apartment
#> 6 1490000 45 1260 Nyon Vaud Apartment
#> floor year_category
#> 1 eg 0-1919
#> 2 eg 0-1919
#> 3 eg 0-1919
#> 4 eg 0-1919
#> 5 eg 0-1919
#> 6 eg 0-1919
3.2 Creating Variable zip_code and merging with AMTOVZ_CSV_LV95
#> price number_of_rooms address canton property_type
#> 1 1800000 65 1844 Villeneuve VD Vaud Apartment
#> 2 1980000 55 1820 Montreux Vaud Apartment
#> 3 488000 35 1882 Gryon Vaud Apartment
#> 4 1755000 7 1820 Montreux Vaud Apartment
#> 5 650000 25 1815 Clarens Vaud Apartment
#> 6 1490000 45 1260 Nyon Vaud Apartment
#> floor year_category
#> 1 eg 0-1919
#> 2 eg 0-1919
#> 3 eg 0-1919
#> 4 eg 0-1919
#> 5 eg 0-1919
#> 6 eg 0-1919
#> price number_of_rooms address canton property_type
#> 1 1800000 65 1844 Villeneuve VD Vaud Apartment
#> 2 1980000 55 1820 Montreux Vaud Apartment
#> 3 488000 35 1882 Gryon Vaud Apartment
#> 4 1755000 7 1820 Montreux Vaud Apartment
#> 5 650000 25 1815 Clarens Vaud Apartment
#> 6 1490000 45 1260 Nyon Vaud Apartment
#> floor year_category zip_code
#> 1 eg 0-1919 1844
#> 2 eg 0-1919 1820
#> 3 eg 0-1919 1882
#> 4 eg 0-1919 1820
#> 5 eg 0-1919 1815
#> 6 eg 0-1919 1260
#> Ortschaftsname PLZ Zusatzziffer Gemeindename BFS.Nr
#> 1 Aeugst am Albis 8914 0 Aeugst am Albis 1
#> 2 Aeugstertal 8914 2 Aeugst am Albis 1
#> 3 Zwillikon 8909 0 Affoltern am Albis 2
#> 4 Affoltern am Albis 8910 0 Affoltern am Albis 2
#> 5 Bonstetten 8906 0 Bonstetten 3
#> 6 Sihlbrugg 6340 4 Hausen am Albis 4
#> Kantonskürzel E N Sprache Validity
#> 1 ZH 2679403 1235842 de 2008-07-01
#> 2 ZH 2679815 1237404 de 2008-07-01
#> 3 ZH 2675280 1238108 de 2008-07-01
#> 4 ZH 2676852 1236930 de 2008-07-01
#> 5 ZH 2677412 1241078 de 2008-07-01
#> 6 ZH 2686082 1230649 de 2008-07-01
#> City zip_code Canton_code
#> 1 Aeugst am Albis 8914 ZH
#> 2 Aeugstertal 8914 ZH
#> 3 Zwillikon 8909 ZH
#> 4 Affoltern am Albis 8910 ZH
#> 5 Bonstetten 8906 ZH
#> 6 Sihlbrugg 6340 ZH
#> zip_code price number_of_rooms
#> 1 25 2200000 10
#> 2 25 2200000 65
#> 3 26 1995000 75
#> 4 26 870490 45
#> 5 322 870000 25
#> 6 322 1295770 45
#> 2253 1200 2450000 6
#> 2254 1200 982130 45
#> 11886 1919 2535730 55
#> 11887 1919 230000 15
#> 11888 1919 1415380 35
#> 11889 1919 1043260 45
#> 11890 1919 2535730 55
#> 17993 2500 1050000 45
#> 17994 2500 1100000 5
#> 17995 2500 887500 55
#> 17996 2500 870500 45
#> 17997 2500 1176820 45
#> 17998 2500 1159550 35
#> 17999 2500 1927050 45
#> 18000 2500 892500 45
#> 18001 2500 887500 45
#> 18002 2500 420000 45
#> 18003 2500 877500 45
#> 18004 2500 885500 55
#> 18005 2500 872500 45
#> 19603 3000 1448610 45
#> 19604 3000 1515060 45
#> 19605 3000 956880 45
#> 19606 3000 1222680 35
#> 19607 3000 1448610 45
#> 19608 3000 1448610 45
#> 19609 3000 1515060 45
#> 19610 3000 820000 55
#> 19611 3000 1222680 35
#> 19612 3000 1590000 55
#> 19613 3000 1448610 45
#> 27169 4000 2100000 65
#> 27170 4000 975000 45
#> 30708 5201 963520 45
#> 33490 6511 584760 3
#> 33927 6547 19935000 55
#> 35207 6602 270000 15
#> 35208 6602 3721200 55
#> 35209 6602 3721200 55
#> 35210 6604 2644710 35
#> 35211 6604 2644710 35
#> 35212 6604 1142940 45
#> 35213 6604 610000 25
#> 35214 6604 810690 35
#> 35215 6604 860000 35
#> 35216 6604 917010 45
#> 35217 6604 1010040 45
#> 40817 6901 3628170 45
#> 40818 6911 877140 55
#> 40819 6911 810690 45
#> 40820 6911 730950 45
#> 40821 6911 465150 35
#> 42848 7133 2246010 35
#> 42861 7135 3575010 65
#> 43231 8000 1295770 35
#> 43232 8000 2100000 45
#> 43233 8000 2495000 55
#> 44144 8238 739000 35
#> 44145 8238 739000 35
#> 44146 8238 716000 35
#> 44147 8238 716000 35
#> 44148 8238 325600 3
#> 44889 8423 2910510 45
#> 44890 8423 2804190 55
#> 47001 9002 3787650 45
#> 47621 9241 724300 35
#> address
#> 1 1000 Lausanne 25
#> 2 1000 Lausanne 25
#> 3 Lausanne 26, 1000 Lausanne 26
#> 4 1000 Lausanne 26
#> 5 Via Cuolm Liung 30d, 7032 Laax GR 2
#> 6 Via Murschetg 29, 7032 Laax GR 2
#> 2253 1200 Genève
#> 2254 Chemin des pralets, 74100 Etrembières, 1200 Genève
#> 11886 1919 Martigny
#> 11887 1919 Martigny
#> 11888 1919 Martigny
#> 11889 1919 Martigny
#> 11890 1919 Martigny
#> 17993 Hohlenweg 11b, 2500 Biel/Bienne
#> 17994 2500 Biel/Bienne
#> 17995 2500 Biel/Bienne
#> 17996 2500 Biel/Bienne
#> 17997 2500 Biel/Bienne
#> 17998 2500 Biel/Bienne
#> 17999 2500 Bienne
#> 18000 2500 Biel/Bienne
#> 18001 2500 Biel/Bienne
#> 18002 2500 Biel/Bienne
#> 18003 2500 Biel/Bienne
#> 18004 2500 Biel/Bienne
#> 18005 2500 Biel/Bienne
#> 19603 3000 Bern
#> 19604 3000 Bern
#> 19605 3000 Bern
#> 19606 3000 Bern
#> 19607 3000 Bern
#> 19608 3000 Bern
#> 19609 3000 Bern
#> 19610 3000 Bern
#> 19611 3000 Bern
#> 19612 3000 Bern
#> 19613 3000 Bern
#> 27169 4000 Basel
#> 27170 4000 Basel
#> 30708 5201 Brugg AG
#> 33490 6511 Cadenazzo
#> 33927 Augio 1F, 6547 Augio
#> 35207 6602 Muralto
#> 35208 6602 Muralto
#> 35209 6602 Muralto
#> 35210 6604 Solduno
#> 35211 6604 Solduno
#> 35212 6604 Solduno
#> 35213 6604 Solduno
#> 35214 6604 Solduno
#> 35215 6604 Solduno
#> 35216 6604 Locarno
#> 35217 6604 Locarno
#> 40817 6901 Lugano
#> 40818 6911 Campione d'Italia
#> 40819 6911 Campione d'Italia
#> 40820 6911 Campione d'Italia
#> 40821 6911 Campione d'Italia
#> 42848 Inder Platenga 34, 7133 Obersaxen
#> 42861 7135 Fideris
#> 43231 8000 Zürich
#> 43232 8000 Zürich
#> 43233 8000 Zürich
#> 44144 8238 Büsingen am Hochrhein
#> 44145 8238 Büsingen am Hochrhein
#> 44146 Junkerstrasse 85, 8238 Büsingen am Hochrhein
#> 44147 Junkerstrasse 85, 8238 Büsingen am Hochrhein
#> 44148 Stemmerstrasse 14, 8238 Büsingen am Hochrhein
#> 44889 Chüngstrasse 48, 8423 Embrach
#> 44890 Chüngstrasse 60, 8423 Embrach
#> 47001 6900 Lugano 2 Paradiso Caselle
#> 47621 9241 Kradolf
#> canton property_type floor year_category City
#> 1 Vaud Single house 1919-1945 <NA>
#> 2 Vaud Villa 2006-2010 <NA>
#> 3 Vaud Villa 1961-1970 <NA>
#> 4 Vaud Apartment noteg 2016-2024 <NA>
#> 5 Grisons Apartment eg 2016-2024 <NA>
#> 6 Grisons Apartment noteg 2011-2015 <NA>
#> 2253 Geneva Bifamiliar house 1981-1990 <NA>
#> 2254 Geneva Bifamiliar house 2016-2024 <NA>
#> 11886 Valais Attic flat noteg 2016-2024 <NA>
#> 11887 Valais Apartment eg 2016-2024 <NA>
#> 11888 Valais Apartment noteg 2016-2024 <NA>
#> 11889 Valais Apartment noteg 2016-2024 <NA>
#> 11890 Valais Apartment noteg 2016-2024 <NA>
#> 17993 Bern Single house 2001-2005 <NA>
#> 17994 Bern Single house 2001-2005 <NA>
#> 17995 Bern Single house 2016-2024 <NA>
#> 17996 Bern Single house 2016-2024 <NA>
#> 17997 Bern Villa 2016-2024 <NA>
#> 17998 Bern Villa 2016-2024 <NA>
#> 17999 Bern Single house 2016-2024 <NA>
#> 18000 Bern Single house 2016-2024 <NA>
#> 18001 Bern Single house 2016-2024 <NA>
#> 18002 Bern Apartment noteg 1971-1980 <NA>
#> 18003 Bern Single house 2016-2024 <NA>
#> 18004 Bern Single house 2016-2024 <NA>
#> 18005 Bern Single house 2016-2024 <NA>
#> 19603 Bern Apartment eg 2016-2024 <NA>
#> 19604 Bern Apartment eg 2016-2024 <NA>
#> 19605 Bern Apartment eg 2016-2024 <NA>
#> 19606 Bern Apartment noteg 2016-2024 <NA>
#> 19607 Bern Apartment noteg 2016-2024 <NA>
#> 19608 Bern Apartment eg 2016-2024 <NA>
#> 19609 Bern Apartment eg 2016-2024 <NA>
#> 19610 Bern Apartment noteg 2016-2024 <NA>
#> 19611 Bern Duplex noteg 2016-2024 <NA>
#> 19612 Bern Apartment noteg 1991-2000 <NA>
#> 19613 Bern Roof flat noteg 2016-2024 <NA>
#> 27169 Basel-Stadt Villa 2016-2024 <NA>
#> 27170 Basel-Stadt Single house 2016-2024 <NA>
#> 30708 Aargau Apartment noteg 2016-2024 <NA>
#> 33490 Ticino Apartment noteg 2016-2024 <NA>
#> 33927 Grisons Single house 2016-2024 <NA>
#> 35207 Ticino Apartment eg 1961-1970 <NA>
#> 35208 Ticino Single house 1981-1990 <NA>
#> 35209 Ticino Single house 1981-1990 <NA>
#> 35210 Ticino Attic flat noteg 2011-2015 <NA>
#> 35211 Ticino Apartment noteg 2011-2015 <NA>
#> 35212 Ticino Apartment noteg 2016-2024 <NA>
#> 35213 Ticino Apartment noteg 2016-2024 <NA>
#> 35214 Ticino Apartment noteg 2016-2024 <NA>
#> 35215 Ticino Apartment noteg 2016-2024 <NA>
#> 35216 Ticino Apartment noteg 2011-2015 <NA>
#> 35217 Ticino Apartment noteg 2011-2015 <NA>
#> 40817 Ticino Attic flat noteg 2011-2015 <NA>
#> 40818 Ticino Single house 1971-1980 <NA>
#> 40819 Ticino Apartment eg 1946-1960 <NA>
#> 40820 Ticino Apartment noteg 1991-2000 <NA>
#> 40821 Ticino Apartment noteg 1946-1960 <NA>
#> 42848 Grisons Single house 2006-2010 <NA>
#> 42861 Grisons Single house 0-1919 <NA>
#> 43231 Zurich Single house 2016-2024 <NA>
#> 43232 Zurich Apartment noteg 2016-2024 <NA>
#> 43233 Zurich Apartment noteg 0-1919 <NA>
#> 44144 Schaffhausen Apartment eg 2016-2024 <NA>
#> 44145 Schaffhausen Attic flat eg 2016-2024 <NA>
#> 44146 Schaffhausen Attic flat noteg 2016-2024 <NA>
#> 44147 Schaffhausen Apartment noteg 2016-2024 <NA>
#> 44148 Schaffhausen Apartment noteg 1961-1970 <NA>
#> 44889 Zurich Single house 2016-2024 <NA>
#> 44890 Zurich Bifamiliar house 2016-2024 <NA>
#> 47001 Ticino Apartment noteg 2011-2015 <NA>
#> 47621 Thurgau Apartment noteg 1991-2000 <NA>
#> Canton_code
#> 1 <NA>
#> 2 <NA>
#> 3 <NA>
#> 4 <NA>
#> 5 <NA>
#> 6 <NA>
#> 2253 <NA>
#> 2254 <NA>
#> 11886 <NA>
#> 11887 <NA>
#> 11888 <NA>
#> 11889 <NA>
#> 11890 <NA>
#> 17993 <NA>
#> 17994 <NA>
#> 17995 <NA>
#> 17996 <NA>
#> 17997 <NA>
#> 17998 <NA>
#> 17999 <NA>
#> 18000 <NA>
#> 18001 <NA>
#> 18002 <NA>
#> 18003 <NA>
#> 18004 <NA>
#> 18005 <NA>
#> 19603 <NA>
#> 19604 <NA>
#> 19605 <NA>
#> 19606 <NA>
#> 19607 <NA>
#> 19608 <NA>
#> 19609 <NA>
#> 19610 <NA>
#> 19611 <NA>
#> 19612 <NA>
#> 19613 <NA>
#> 27169 <NA>
#> 27170 <NA>
#> 30708 <NA>
#> 33490 <NA>
#> 33927 <NA>
#> 35207 <NA>
#> 35208 <NA>
#> 35209 <NA>
#> 35210 <NA>
#> 35211 <NA>
#> 35212 <NA>
#> 35213 <NA>
#> 35214 <NA>
#> 35215 <NA>
#> 35216 <NA>
#> 35217 <NA>
#> 40817 <NA>
#> 40818 <NA>
#> 40819 <NA>
#> 40820 <NA>
#> 40821 <NA>
#> 42848 <NA>
#> 42861 <NA>
#> 43231 <NA>
#> 43232 <NA>
#> 43233 <NA>
#> 44144 <NA>
#> 44145 <NA>
#> 44146 <NA>
#> 44147 <NA>
#> 44148 <NA>
#> 44889 <NA>
#> 44890 <NA>
#> 47001 <NA>
#> 47621 <NA>
We have 144 NAN, where
- The zip code was not found in the atmo df
- The zip code was incorectly isolated from the address
Removed them ::: {.cell layout-align=“center”}
:::
3.3 Tax data cleaning
3.3.1 Merging the two datasets
##Dataset used for the rest of the analysis ::: {.cell layout-align=“center”}
:::
3.4 Cleaning of commune data
Replaces NAs in both Taux de couverture social and Political (Conseil National Datas) For Taux de couverture Social: NAs were due to reason “Q” = “Not indicated to protect confidentiality” We replaced the NAs by the average taux de couverture in Switzerland in 2019, which was 3.2%
For Political data: NAs were due to reason “M” = “Not indicated because data was not important or applicable” Therefore, we replaced the NAs by 0
4 EDA
4.1 Change the path below
4.2 Histogram of prices
4.3 Histogram of prices for each property type
note : only price between 0 and 500000 so some outliers aren’t here
4.4 Histogram of prices for each year category
note : only price between 0 and 500000 so some outliers aren’t here
4.5 Histogram of prices for each canton
note : only price between 0 and 500000 so some outliers aren’t here
4.6 Histogram of prices for each number of rooms
note : only price between 0 and 500000 so some outliers aren’t here
and the graph below only show apartments with less than 10 rooms (but you can change the code if needed
4.7 Histogram of prices with impot
4.8 Test Regression
#>
#> Call:
#> lm(formula = price ~ number_of_rooms + canton + property_type +
#> year_category, data = properties)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -7013788 -514438 -138948 264464 21628996
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) -677158 55739 -12.15 < 2e-16
#> number_of_rooms 337946 6166 54.81 < 2e-16
#> cantonappenzell-ausser-rhoden -464945 126861 -3.66 0.00025
#> cantonappenzell-inner-rhoden -874289 392590 -2.23 0.02596
#> cantonbasel-landschaft -195701 57943 -3.38 0.00073
#> cantonbasel-stadt 218682 105130 2.08 0.03753
#> cantonbern -478376 46221 -10.35 < 2e-16
#> cantonfribourg -781416 48366 -16.16 < 2e-16
#> cantongeneva 2025260 62234 32.54 < 2e-16
#> cantonglarus -573694 173301 -3.31 0.00093
#> cantongrisons 59982 71666 0.84 0.40262
#> cantonjura -801519 77323 -10.37 < 2e-16
#> cantonlucerne -187979 73261 -2.57 0.01030
#> cantonneuchatel -353635 65590 -5.39 7.1e-08
#> cantonnidwalden 991055 244826 4.05 5.2e-05
#> cantonobwalden 366062 244712 1.50 0.13470
#> cantonschaffhausen -584997 120601 -4.85 1.2e-06
#> cantonschwyz 18070 132558 0.14 0.89157
#> cantonsolothurn -784557 61024 -12.86 < 2e-16
#> cantonst-gallen -404890 55918 -7.24 4.6e-13
#> cantonthurgau -37337 63444 -0.59 0.55620
#> cantonticino 125913 38499 3.27 0.00108
#> cantonuri 9578 155772 0.06 0.95097
#> cantonvalais -219964 39781 -5.53 3.3e-08
#> cantonvaud 89914 40258 2.23 0.02553
#> cantonzug 801241 153896 5.21 1.9e-07
#> cantonzurich 316099 49688 6.36 2.0e-10
#> property_typeAttic flat 311019 45964 6.77 1.4e-11
#> property_typeBifamiliar house 41841 42939 0.97 0.32986
#> property_typeChalet 1136804 56690 20.05 < 2e-16
#> property_typeDuplex -5091 56699 -0.09 0.92846
#> property_typeFarm house 237939 118848 2.00 0.04529
#> property_typeLoft 285442 291977 0.98 0.32827
#> property_typeRoof flat 4801 64587 0.07 0.94074
#> property_typeRustic house -281265 249068 -1.13 0.25880
#> property_typeSingle house 389066 24252 16.04 < 2e-16
#> property_typeTerrace flat 88662 87071 1.02 0.30856
#> property_typeVilla 1278283 38187 33.47 < 2e-16
#> year_category1919-1945 10462 61602 0.17 0.86515
#> year_category1946-1960 76025 57261 1.33 0.18429
#> year_category1961-1970 232055 48444 4.79 1.7e-06
#> year_category1971-1980 210609 43422 4.85 1.2e-06
#> year_category1981-1990 237789 43679 5.44 5.3e-08
#> year_category1991-2000 477554 45385 10.52 < 2e-16
#> year_category2001-2005 519338 55369 9.38 < 2e-16
#> year_category2006-2010 591351 48030 12.31 < 2e-16
#> year_category2011-2015 724194 47219 15.34 < 2e-16
#> year_category2016-2024 641233 36926 17.37 < 2e-16
#>
#> (Intercept) ***
#> number_of_rooms ***
#> cantonappenzell-ausser-rhoden ***
#> cantonappenzell-inner-rhoden *
#> cantonbasel-landschaft ***
#> cantonbasel-stadt *
#> cantonbern ***
#> cantonfribourg ***
#> cantongeneva ***
#> cantonglarus ***
#> cantongrisons
#> cantonjura ***
#> cantonlucerne *
#> cantonneuchatel ***
#> cantonnidwalden ***
#> cantonobwalden
#> cantonschaffhausen ***
#> cantonschwyz
#> cantonsolothurn ***
#> cantonst-gallen ***
#> cantonthurgau
#> cantonticino **
#> cantonuri
#> cantonvalais ***
#> cantonvaud *
#> cantonzug ***
#> cantonzurich ***
#> property_typeAttic flat ***
#> property_typeBifamiliar house
#> property_typeChalet ***
#> property_typeDuplex
#> property_typeFarm house *
#> property_typeLoft
#> property_typeRoof flat
#> property_typeRustic house
#> property_typeSingle house ***
#> property_typeTerrace flat
#> property_typeVilla ***
#> year_category1919-1945
#> year_category1946-1960
#> year_category1961-1970 ***
#> year_category1971-1980 ***
#> year_category1981-1990 ***
#> year_category1991-2000 ***
#> year_category2001-2005 ***
#> year_category2006-2010 ***
#> year_category2011-2015 ***
#> year_category2016-2024 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 1240000 on 21363 degrees of freedom
#> (72 observations deleted due to missingness)
#> Multiple R-squared: 0.323, Adjusted R-squared: 0.321
#> F-statistic: 216 on 47 and 21363 DF, p-value: <2e-16
5 Supervised learning
- Data splitting (if a training/test set split is enough for the global analysis, at least one CV or bootstrap must be used)
- Two or more models
- Two or more scores
- Tuning of one or more hyperparameters per model
- Interpretation of the model(s)
6 Unsupervised learning
- Clustering and/or dimension reduction
7 Conclusion
- Brief summary of the project
- Take home message
- Limitations
- Future work?